xkcd 1313: Regex Golf

Peter Norvig
January 2014
revised November 2015

I ♡ xkcd! It reliably provides top-rate insights, humor, or both. I was thrilled when I got to introduce Randall Monroe for a talk in 2007. But in xkcd #1313,

I found that the hover text, "/bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls/ matches the last names of elected US presidents but not their opponents", contains a confusing contradiction. I'm old enough to remember that Jimmy Carter won one term and lost a second. No regular expression could both match and not match "Carter".

But this got me thinking: can I come up with an algorithm to match or beat Randall's regex golf scores? The game is on.

Presidents

I started by finding a listing of presidential elections, giving me these winners and losers:


In [1]:
from __future__ import division, print_function
import re
import itertools

def words(text): return set(text.split())

winners = words('''washington adams jefferson jefferson madison madison monroe 
    monroe adams jackson jackson van-buren harrison polk taylor pierce buchanan 
    lincoln lincoln grant grapartnt hayes garfield cleveland harrison cleveland mckinley
    mckinley roosevelt taft wilson wilson harding coolidge hoover roosevelt 
    roosevelt roosevelt roosevelt truman eisenhower eisenhower kennedy johnson nixon 
    nixon carter reagan reagan bush clinton clinton bush bush obama obama''')

losers = words('''clinton jefferson adams pinckney pinckney clinton king adams 
    jackson adams clay van-buren van-buren clay cass scott fremont breckinridge 
    mcclellan seymour greeley tilden hancock blaine cleveland harrison bryan bryan 
    parker bryan roosevelt hughes cox davis smith hoover landon willkie dewey dewey 
    stevenson stevenson nixon goldwater humphrey mcgovern ford carter mondale 
    dukakis bush dole gore kerry mccain romney''')

We can see that there are multiple names that are both winners and losers:


In [2]:
winners & losers


Out[2]:
{'adams',
 'bush',
 'carter',
 'cleveland',
 'clinton',
 'harrison',
 'hoover',
 'jackson',
 'jefferson',
 'nixon',
 'roosevelt',
 'van-buren'}

Clinton? He won both his elections, didn't he? Yes, Bill Clinton did, but George Clinton (the Revolutionary War leader, not the Funkadelic leader) was a losing opponent in 1792 and 1812. To avoid the contradiction, I decided to eliminate all winners from the set of losers (and in fact Randall later confirmed that that was his intent):


In [3]:
losers = losers - winners

Defining the Problem

Let's be clear exactly what we're trying to achieve. We're looking for a Python regular expression which, when used with the re.search function, will match all of the winners but none of the losers. We can define the function mistakes to return a set of misclassifications; if mistakes is empty then the regular expression is verified correct:


In [4]:
def mistakes(regex, winners, losers):
    "The set of mistakes made by this regex in classifying winners and losers."
    return ({"Should have matched: " + W 
             for W in winners if not re.search(regex, W)} |
            {"Should not have matched: " + L 
             for L in losers if re.search(regex, L)})

def verify(regex, winners, losers): 
    assert not mistakes(regex, winners, losers)
    return True

Let's check the xkcd regex:


In [5]:
xharag01 = "[gikuj]..n|a.[alt]|[pivo].l|i..o|[jocy]e|sh|di|oo"

mistakes(xkcd, winners, losers)


Out[5]:
{'Should not have matched: fremont'}

The xkcd regex incorrectly matches "fremont", representing John C. Frémont, the Republican candidate who lost to James Buchanan in 1856. Could Randall Monroe have made an error? Is someone wrong on the Internet? Investigating the 1856 election, I see that Randall must have had Millard Fillmore, the third-party candidate, as the opponent. Fillmore is more famous, having served as the 13th president (although he never won an election; he became president when Taylor died in office). But Fillmore only got 8 electoral votes in 1856 while Fremont got 114, so I will stick with Fremont in my list of losers.

We can verify that Randall got it right under the interpretation that Fillmore, not Fremont, was the loser:


In [6]:
alternative_losers = {'fillmore'} | losers - {'fremont'}

verify(xkcd, winners, alternative_losers)


Out[6]:
True

Strategy for Finding a Regex

We need a strategy to find a regex that matches all the winners but none of the losers. I came up with this approach:

  • Generate a pool of regex parts: small regexes of a few characters, such as "bu" or "r.e$" or "j".
  • Consider only parts that match at least one winner, but no losers.
  • Our solution will be formed by "or"-ing together some of these parts (e.g. "bu|n.e|j").
  • This is a set cover problem: pick some of the parts so that they cover all the winners.
  • Set cover is an NP-hard problem, so I feel justified in using an approximation approach that finds a small but not necessarily smallest solution.
  • For many NP-hard problems a good approximation can be had with a greedy algorithm: Pick the "best" part first (the one that covers the most winners with the fewest characters), and repeat, choosing the "best" each time until there are no more winners to cover.
  • To guarantee that we will find a solution, make sure that each winner has at least one part that matches it.

There are three ways this strategy can fail to find the shortest possible regex:

  • The shortest regex might not be a disjunction. Our strategy can only find disjunctions (of the form "a|b|c|...").
  • The shortest regex might be a disjunction formed with different parts. For example, "[rn]t" is not in our pool of parts.
  • The greedy algorithm isn't guaranteed to find the shortest solution. We might have all the right parts, but pick the wrong ones.

The algorithm is below. Our pool of parts is a set of strings created with regex_parts(winners, losers). We accumulate parts into the list solution, which starts empty. On each iteration choose the best part: the one with a maximum score. (I decided by default to score 4 points for each winner matched, minus one point for each character in the part.) We then add the best part to solution, and remove from winners all the strings that are matched by best. Finally, we update the pool, keeping only those parts that still match one or more of the remaining winners. When there are no more winners left, OR together all the solution parts to give the final regular expression string.


In [7]:
def findregex(winners, losers, k=4):
    "Find a regex that matches all winners but no losers (sets of strings)."
    # Make a pool of regex parts, then pick from them to cover winners.
    # On each iteration, add the 'best' part to 'solution',
    # remove winners covered by best, and keep in 'pool' only parts
    # that still match some winner.
    pool = regex_parts(winners, losers)
    solution = []
    def score(part): return k * len(matches(part, winners)) - len(part)
    while winners:
        best = max(pool, key=score)
        solution.append(best)
        winners = winners - matches(best, winners)
        pool = {r for r in pool if matches(r, winners)}
    return OR(solution)

def matches(regex, strings):
    "Return a set of all the strings that are matched by regex."
    return {s for s in strings if re.search(regex, s)}

OR = '|'.join # Join a sequence of strings with '|' between them

Glossary

Just to be clear, I define the terms I will be using:

  • winners: A set of strings; our solution is required to match each of them.
  • losers: A set of strings; our solution is not allowed to match any of them.
  • part: A small regular expression, a string, such as 'bu' or 'a.a'.
  • pool: A set of parts from which we will pick a subset to form the solution.
  • regex: A regular expression; a pattern used to match against a string.
  • solution: A regular expression that matches all winners but no losers.
  • whole: A part that matches a whole word (and nothing else): '^word$'

Regex Parts

Now we need to define what the regex_parts are. Here's what I came up with:

  • For each winner, include a regex that matches the entire string exactly. I call this regex a whole.
    Example: for 'word', include '^word$'
  • For each whole, generate subparts consisting of 1 to 4 consecutive characters.
    Example: subparts('^it$') == {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
  • For each subpart, generate all ways to replace any of the letters with a dot (the "match any" character).
    Example: dotify('it') == {'it', 'i.', '.t', '..'}
  • Keep only the dotified subparts that do not match any of the losers.

Note that I only used a few of the regular expression mechanisms: '.', '^', and '$'. I didn't try to use character classes ([a-z]), nor any of the repetition operators, nor other advanced mechanisms. Why? I thought that the advanced features usually take too many characters. For example, I don't allow the part '[rn]t', but I can achieve the same effect with the same number of characters by combining two parts: 'rt|nt'. I could add more complicated mechanisms later, but for now, YAGNI. Here is the code:


In [8]:
def regex_parts(winners, losers):
    "Return parts that match at least one winner, but no loser."
    wholes = {'^' + w + '$'  for w in winners}
    parts = {d for w in wholes for p in subparts(w) for d in dotify(p)}
    return wholes | {p for p in parts if not matches(p, losers)}

def subparts(word, N=4):
    "Return a set of subparts of word: consecutive characters up to length N (default 4)."
    return set(word[i:i+n+1] for i in range(len(word)) for n in range(N)) 
    
def dotify(part):
    "Return all ways to replace a subset of chars in part with '.'."
    choices = map(replacements, part)
    return {cat(chars) for chars in itertools.product(*choices)}

def replacements(c): return c if c in '^$' else c + '.'

cat = ''.join

Our program is complete! We can run findregex, verify the solution, and compare the length of our solution to Randall's:


In [9]:
solution = findregex(winners, losers)
verify(solution, winners, losers)

len(solution), solution


Out[9]:
(53, 'a.a|i..n|j|oo|a.t|i..o|a..i|bu|n.e|ay.|r.e$|po|ma|nd$')

In [10]:
len(xkcd), xkcd


Out[10]:
(63, 'bu|[rn]t|[coy]e|[mtg]a|j|iso|n[hl]|[ae]d|lev|sh|[lnd]i|[po]o|ls')

Our regex is 15% shorter than Randall's—success!

Tests

Here's a test suite to give us more confidence in (and familiarity with) our functions:


In [11]:
def tests():
    assert subparts('^it$') == {'^', 'i', 't', '$', '^i', 'it', 't$', '^it', 'it$', '^it$'}
    assert subparts('this') == {'t', 'h', 'i', 's', 'th', 'hi', 'is', 'thi', 'his', 'this'}
    subparts('banana') == {'a', 'an', 'ana', 'anan', 'b', 'ba', 'ban', 'bana', 
                           'n', 'na', 'nan', 'nana'}
    
    assert dotify('it') == {'it', 'i.', '.t', '..'}
    assert dotify('^it$') == {'^it$', '^i.$', '^.t$', '^..$'}
    assert dotify('this') == {'this', 'thi.', 'th.s', 'th..', 't.is', 't.i.', 't..s', 't...',
                              '.his', '.hi.', '.h.s', '.h..', '..is', '..i.', '...s', '....'}
    assert regex_parts({'win'}, {'losers', 'bin', 'won'}) == {
        '^win$', '^win', '^wi.', 'wi.',  'wi', '^wi', 'win$', 'win', 'wi.$'}
    assert regex_parts({'win'}, {'bin', 'won', 'wine', 'wit'}) == {'^win$', 'win$'}
    regex_parts({'boy', 'coy'}, 
                {'ahoy', 'toy', 'book', 'cook', 'boycott', 'cowboy', 'cod', 'buy', 'oy', 
                 'foil', 'coyote'}) == {'^boy$', '^coy$', 'c.y$', 'coy$'}
    
    assert matches('a|b|c', {'a', 'b', 'c', 'd', 'e'}) == {'a', 'b', 'c'}
    assert matches('a|b|c', {'any', 'bee', 'succeed', 'dee', 'eee!'}) == {
        'any', 'bee', 'succeed'}
    
    assert OR(['a', 'b', 'c']) == 'a|b|c'
    assert OR(['a']) == 'a'
    
    assert words('this is a test this is') == {'this', 'is', 'a', 'test'}
    
    assert findregex({"ahahah", "ciao"},  {"ahaha", "bye"}) == 'a.$'
    assert findregex({"this", "that", "the other"}, {"one", "two", "here", "there"}) == 'h..$'
    assert findregex({'boy', 'coy', 'toy', 'joy'}, {'ahoy', 'buy', 'oy', 'foil'}) == '^.oy'
    
    assert not mistakes('a|b|c', {'ahoy', 'boy', 'coy'}, {'joy', 'toy'})
    assert not mistakes('^a|^b|^c', {'ahoy', 'boy', 'coy'}, {'joy', 'toy', 'kickback'})
    assert mistakes('^.oy', {'ahoy', 'boy', 'coy'}, {'joy', 'ploy'}) == {
        "Should have matched: ahoy", 
        "Should not have matched: joy"}
    return 'tests pass'

tests()


Out[11]:
'tests pass'

Regex Golf with Arbitrary Lists

Let's move on to arbitrary lists. I define report, to call findregex, verify the solution, and print the number of characters in the solution, the number of parts, the competitive ratio (the ratio between the lengths of a trivial solution and the actual solution), and the number of winners and losers.


In [12]:
def report(winners, losers):
    "Find a regex to match A but not B, and vice-versa.  Print summary."
    solution = findregex(winners, losers)
    verify(solution, winners, losers)
    trivial  = '^(' + OR(winners) + ')$'
    print('Characters: {}, Parts: {}, Competitive ratio: {:.1f}, Winners: {}, Losers: {}'.format(
          len(solution), solution.count('|') + 1, len(trivial) / len(solution) , len(winners), len(losers)))
    return solution

report(winners, losers)


Characters: 53, Parts: 14, Competitive ratio: 5.2, Winners: 35, Losers: 34
Out[12]:
'a.a|i..n|j|oo|a.t|i..o|a..i|bu|n.e|ay.|r.e$|po|ma|nd$'

The top 10 boys and girls names for 2012:


In [13]:
boys  = words('jacob mason ethan noah william liam jayden michael alexander aiden')
girls = words('sophia emma isabella olivia ava emily abigail mia madison elizabeth')

report(boys, girls)


Characters: 11, Parts: 3, Competitive ratio: 6.4, Winners: 10, Losers: 10
Out[13]:
'e.$|a.$|a.o'

This is interesting because 'a.$|e.$|a.o' is an example of something that could be re-written in a shorter form if we allowed more complex parts. The following would save one character:


In [14]:
verify('[ae].(o|$)', boys, girls)


Out[14]:
True

We have now fulfilled panel two of the strip. Let's try another example, separating the top ten best-selling drugs from the top 10 cities to visit:


In [15]:
drugs = words('lipitor nexium plavix advair ablify seroquel singulair crestor actos epogen')
cities = words('paris trinidad capetown riga zurich shanghai vancouver chicago adelaide auckland')

report(drugs, cities)


Characters: 15, Parts: 6, Competitive ratio: 5.3, Winners: 10, Losers: 10
Out[15]:
'o.$|x|ir|q|f|po'

We can answer the challenge from panel one of the strip:


In [16]:
def phrases(text, sep='/'): return {line.upper().strip() for line in text.split(sep)}

starwars = phrases('''The Phantom Menace / Attack of the Clones / Revenge of the Sith /
                      A New Hope / The Empire Strikes Back / Return of the Jedi''')

startrek = phrases('''The Wrath of Khan / The Search for Spock / The Voyage Home /
                      The Final Frontier / The Undiscovered Country / Generations / 
                      First Contact / Insurrection / Nemesis''')

report(starwars, startrek)


Characters: 9, Parts: 3, Competitive ratio: 13.0, Winners: 6, Losers: 9
Out[16]:
' T|E.P|OP'

Neat—our solution is one character shorter than Randall's. We can verify that Randall's solution is also correct:


In [17]:
verify('M | [TN]|B', starwars, startrek)


Out[17]:
True

Update (Nov 2015): There are two new movies in the works!

Let's add them:


In [18]:
starwars.add('THE FORCE AWAKENS')
startrek.add('BEYOND')

findregex(starwars, startrek)


Out[18]:
' T|CE| ..P'

The two movies cost us one more character.

There are lots of examples to play with over at regex.alf.nu, like this one:


In [19]:
foo = words('''afoot catfoot dogfoot fanfoot foody foolery foolish fooster 
    footage foothot footle footpad footway hotfoot jawfoot mafoo nonfood padfoot 
    prefool sfoot unfool''')

bar = words('''Atlas Aymoro Iberic Mahran Ormazd Silipan altared chandoo crenel 
    crooked fardo folksy forest hebamic idgah manlike marly palazzi sixfold 
    tarrock unfold''')

report(foo, bar)


Characters: 3, Parts: 1, Competitive ratio: 53.7, Winners: 21, Losers: 21
Out[19]:
'foo'

The answer varies with different runs; sometimes it is 'foo' and sometimes 'f.o'. Both have 3 characters, but 'f.o' is smaller in terms of the total amount of ink/pixels needed to render it. (How can the answer vary, when there are no calls to any random function? Because when max iterates over a set and several elements have the same best score, it is unspecified which one will be selected.)

Of course, we can run any of these examples in the other direction:


In [20]:
report(bar, foo)


Characters: 26, Parts: 8, Competitive ratio: 6.0, Winners: 21, Losers: 21
Out[20]:
'r..$|k|.m|...n|ld|la|dg|or'

What To Do Next?

I see two options:

  • Stop here and declare victory! Yay!
  • Try to make the program faster and capable of finding shorter regexes.

My first inclination was "stop here", and that's what this notebook will shortly do. But several correspondents offered very interesting suggestions, so I returned to the problem in a second notebook.

I was asked whether Randall was wrong to come up with "only" a 10-character Star Wars regex, whereas I showed there is a 9-character version. I would say that, given his role as a cartoonist, author, public speaker, educator, and entertainer, he has chosen ... wisely. He wrote a program that was good enough to allow him to make a great webcomic. A 9-character regex would not improve the comic. Randall stated that he used a genetic algorithm to find his regexes, and it has been said that genetic algorithms are often the second (or was it the third?) best method to solve any problem, and that's all he needed. But if you consider that in addition to all those roles, Randall is also still a practicing computer scientist, you could say he chose ... poorly. Genetic algorithms are good when you want to combine the structure of two solutions to yield a better solution, so they would work well if the best regexes had a complicated tree structure. But they don't! The best solutions are disjunctions of small parts. So the genetic algorithm is trying to combine the first half of one disjunction with the second half of another—but that isn't useful, because the components of a disjunction are unordered; imposing an ordering on them doesn't help.

Summary

That was fun! I hope this page gives you an idea of how to think about problems like this. Let's review what we did:

  • Found an interesting problem (in a comic strip) and realized that it would not be hard to program a solution.
  • Wrote the function mistakes to prove that we really understand exactly what the problem is.
  • Came up with an approach: create lots of regex parts, and "or" together a subset of them.
  • Realized that this is an instance of a known problem, set cover.
  • Since set cover is computationally expensive, decide to use a greedy algorithm, which will be efficient (although not optimal).
  • Decided what goes into the pool of regex parts.
  • Implemented an algorithm to greedily pick parts from the pool (the function findregex).
  • Tried the algorithm on some examples.
  • Declared victory!

Thanks!

Thanks especially to Randall Monroe for inspiring me to do this, to regex.alf.nu for inspiring Randall, to Sean Lip for correcting "Wilkie" to "Willkie," and to Davide Canton, Thomas Breuel, and Stefan Pochmann for providing suggestions to improve my code.


Peter Norvig, Jan. 2014


In [22]:
report(words(""" 000000000
 000000003
 000000006
 000000009
 000000012
 000000015
 066990060
 140091876
 173655750
 312440187
 321769005
 368542278
 390259104
 402223947
 443512431
 714541758
 747289572
 819148602
 878531775
 905586303
 9537348"""), words("""000000005
 000000008
 000000010
 000000011
 000000014
 018990130
 112057285
 159747125
 176950268
 259108903
 333162608
 388401457
 477848777
 478621693
 531683939
 704168662
 759282218
 769340942
 851936815
 973816159
 979204403"""))


Characters: 44, Parts: 12, Competitive ratio: 4.8, Winners: 21, Losers: 21
Out[22]:
'.17|4.2|1.4|06|55|009|37|0.2|00$|191|015|003'

In [ ]: